import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
clrs = ['#025159', '#03A696', '#F28705', '#F25D27', '#F20505']
Author Roman Studer
This notebook is dedicated to the explorative data analysis of the data set used in the Challenge FS20C7 - Recommendations in Retail. Since the execution of the calculations to create the graphics in this notebook is rather long, a PDF and HTML version is available here. Please note that in the PDF the interactivity of the graphics is lost.
Import and first look at Dataset
df = pd.read_csv('Recommender4Retail.csv', index_col='Unnamed: 0');
df.head()
df.tail()
df.columns
The following columns are included in the dataset:
Where SET is one of the four following evaluation sets (eval_set in orders):
# shape
f'Dimensionality of the dataset {df.shape}'
Dropping columns and deal with missing data
The following columns are not included in the analysis:
df = df.drop(['eval_set','order_number','add_to_cart_order','order_dow','order_hour_of_day', 'days_since_prior_order'], axis=1)
# checking datatypes
df.dtypes
Data types per feature are correct.
Rows with missing values can no longer be used. Since these are individual transactions that at first glance do not appear to have any connection (apart from possible customer preferences) we do not impute missing data.
df.drop_duplicates(inplace=True)
df.isna().sum()
Visual analysis of the distribution of products in the dataset.
df_department = df.groupby(by='department')['product_id'].nunique().reset_index()
color = [clrs[0],]*len(df_department.department.unique())
color[14], color[15] = '#8c8c8c', '#8c8c8c'
fig = go.Figure(data=[go.Bar(
x=df_department.department,
y=df_department.product_id,
marker_color=color, # marker color can be a single color value or an iterable
)]).update_xaxes(categoryorder='total descending')
fig.update_layout(title_text='Number of products per department',
xaxis_title="Department",
yaxis_title="Count")
fig.show()
The products are divided into 19 departments. In the graphic above, the departments Pantry, Personal Care, Snacks, Beverages and Frozen are particularly prominent. It can also be seen that there are products that cannot be assigned to a specific department. These 548 products fall into the category 'other'. A larger proportion (1258 products) cannot even be assigned to the category 'other' and were given the designation 'missing'. missing' and 'other' are greyed out in the graphic.
# products per aisle
df_aisle = df.groupby('aisle')['product_id'].nunique().reset_index()
fig = px.bar(df_aisle,
x='aisle',
y='product_id',
title='Number of products per aisle',
labels = {'product_id':'Count', 'aisle':'Aisle'},
color_discrete_sequence = [clrs[0]]).update_xaxes(categoryorder='total descending')
fig.show()
The individual aisles of the departments, listed by number of products, show a range from 1258 up to 12 products per aisle.
df_dep_aisle = df.groupby(by=['department','aisle'])['product_id'].nunique().reset_index()
fig = px.treemap(df_dep_aisle,
path=['department','aisle'],
values='product_id',
title='Ratio of orders in departments and aisles',
color='department')
fig.show()
Product groups (Aisles) in the departments can be visualized well in a treemap. It is well visible that the top 5 departments contain slightly more than half of the available product range. To focus on a department, click on it.
# most ordered products
most_ordered = df['product_name'].value_counts()
# top 20 products
print(f'Top 20 Products: \n{most_ordered[:20]}')
At first glance, the top products appear to be organic products such as fruits and vegetables. At the very top, the banana.
Visual analysis of the distribution of orders in the dataset.
# Number of products in orders
df_n_order = df.groupby(by=['order_id'])['product_id'].count().reset_index()
n_orders = df_n_order.product_id.value_counts()
fig = px.bar(x=n_orders.index,
y=n_orders.values,
title='Number of products in a shopping cart',
labels={'x':'Number of products in basket','y':'Number of orders'},
color_discrete_sequence=[clrs[1]])
fig.show()
The upper graphic shows how often a shopping cart contains a certain number of products. You can see here that there are orders with more than 140 products. The vast majority, however, have between one and twenty products in their shopping cart. The peak lies at five products with over 237 thousand orders containing this number of products.
# orders per customer
df_c_order = df.groupby(by=['order_id'])['product_id'].count().reset_index()
fig = px.box(data_frame=df_c_order,
x='product_id',
title='Boxplot, number of products in basket',
labels={'x':'Number of products in order'},
color_discrete_sequence=[clrs[1]],
orientation='h')
fig.show()